This is the manual for RaiskaaHTML, a tool to strip junk from HTML documents. This document describes version 1.1, released 20.10.98.
As the technology behind Internet and WWW is an incredible piece of crap, it is quite painful for the user to access these resources online.
Nevertheless it offers a basically easy to use and cheap way to obtain information that would be difficult or expensive to get from somewhere else. Therefor the only reasonable application of WWW is to download everything of use to ones hard disk. This can efficiently be done with recursive download tools like wget. To make their usage more painless, browser frontends like WgetRexx exist.
Still the problem remains that these documents often use dreadful colors and layout, not only looking ugly but also wasting space. Although most browsers offer the option to ignore the document provided colors, this usually affects all documents.
This is where
RaiskaaHTML comes in: it can remove all sorts of junk from HTML
documents on your local storage, including annoying colors, invisible
meta tags, useless scripts, bloated comments and redundant white
space.
To successfully run RaiskaaHTML, you need at least AmigaOS 2.04 with ARexx installed and running.
Furthermore rexxdossupport.library has to be installed. RexxDosSupport is copyright by Hartmut Goebel and can be downloaded from aminet:util/rexx/rexxdossupport.lha.
For fully understanding all possible command line options, some understanding of HTML is required. For using only the basic functions, this is hopefully not necessary.
As long as the HTML code you process is correct, you won't get any syntax error messages. RaiskaaHTML is quite tolerant and can cope with most HTML code, even if it contains some minor errors - it does not have to be completely SGML conformant. With serious errors however it refuses to convert the document. You are then supposed to fix the problem and retry. For that you will again need some understanding of HTML.
If you want to see error messages show up in the ScMsg message browser instead of in the console, sc:c/scmsg being part of the SAS/c compiler package has to be installed. This is completely optional though.
There are two files you have to take care of:
To install all this, copy TodellakinRaiskaaHTML to some directory within your Workbench search path and store RaiskaaHTML.rexx wherever you put your supporting scripts. You can use the following commands in CLI:
cd where ever you extracted the archive/RaiskaaHTML
copy RaiskaaHTML.rexx TO rexx:
copy TodellakinRaiskaaHTML TO c:
Preferably you should also store this manual in some place where you can find it again, for instance in Help:RaiskaaHTML.
For a quick example, check example/fancy.html coming with this archive: it contains a dreadful looking web page as they are often done by insane web authors. Now enter in CLI:
rx RaiskaaHTML.rexx example/fancy.html to ram:fancy.html blink color
Load the resulting ram:fancy.html in your browser and notice the difference. Isn't it a relieve?
If you prefer to use RaiskaaHTML from CLI, you might want to add the following line to your s:Shell-Startup:
alias raiskaa rx RaiskaaHTML.rexx blink color []
Feel free to replace "blink color" by whatever options fit you.
Command line options are discussed in more detail later. Here is a short explanation of some switches you can specify to influence the output document:
The below options might cause problems in certain documents. Use them with care.
Furthermore is it possible to specify a directory as From and to overwrite documents by not specifying To.
As no reasonable thinking human being remembers command line options, you might also be interested in learning how to integrate the RaiskaaHTML into Directory Opus and your your web browser. But before there is one thing to discuss that is easier to explain in CLI:
As a matter of fact, many HTML documents contain syntax errors.
RaiskaaHTML does not parse the HTML code very exactly, but still requires the basic tag/attribute structure to be intact. If this is not the case, it displays an error message shortly describing the problem. There are two classes of errors: warnings and errors.
A warning points out a minor problem that can be worked
around. For example, if the documents contains a ">
"
sign not being use to denote an end of tag call, it is replaced by
">" in the output. Still the actual problem might have occurred
already earlier, and the output might not look like you want it
to.
An error however is too difficult for RaiskaaHTML to fix. Therefor you have to do it manually.
You can find an example document containing a minor problem in example/warning.html. Let's see what happens if you type the following in CLI:
rx RaiskaaHTML.rexx example/warning.html to ram:test.html
This results in the following output:
example/warning.html: 7, 20: warning: unmatched ">"
Even though the document was not correct, ram:test.html could be created.
If you are wondering what those two numbers mean: 7
denotes the line and 20
the column in the input file
where the problem was experienced. This information becomes handy if
the document contains serious errors as the included example/error.html. Try this in
CLI:
rx RaiskaaHTML.rexx example/error.html to ram:test.html
This time, no output is written and following message is displayed:
example/error.html: 12, 27: error: "*" is not an HTML attribute
If you take a look at the example/error.html, you
should notice that the "<
" in line 12 has to be
changed to "<
". Load the document into your editor
and fix the problem. After that you can start RaiskaaHTML again and
it should accept the data without any further whining.
Maybe you think loading the documents and moving to the requested line is a cumbersome task. Not if you have the SAS/c message browser installed. In this case, repeat the last example with ScMsg enabled:
rx RaiskaaHTML.rexx example/error.html to ram:test.html ScMsg
Then click in the browser and wait for your editor to load the document and jump at least to the proper line. (Unfortunately the message browser is too stupid to deal with columns.)
As the purpose of RaiskaaHTML is not to help you fixing errors in HTML documents, the error handling mechanisms are not very sophisticated and the resulting messages might not be of much help for you. In such a case, use a real syntax checker before trying run RaiskaaHTML again.
Also note that the fact that a document was accepted by RaiskaaHTML does not tell much about its correctness because the program hardly cares about anything. It only tests for certain tags and attributes to be stripped and looks for the <pre> tag and SGML comments.
Of course you can assign RaiskaaHTML to a menu, button or what ever in Directory Opus. When the function editor pops up, specify the following command:
Type | Command |
ARexx | RaiskaaHTML.rexx {o} blink color |
Flags |
CD source Do all files Output to window Rescan source Window close button |
Feel free to replace "blink color" by whatever options fit you.
This way you can select multiple files and whole directories containing HTML documents. When started, a window opens where RaiskaaHTML displays its progress status.
If you have a browser that allows you to create ARexx macros, you can assign RaiskaaHTML to a button or menu.
RaiskaaHTML also accepts the URI-format for the From parameter. Naturally it can only deal with URIs of type file://localhost/ because it does not have a network functions implemented. This should not be a problem as you do not have write access to documents written by other people anyway, so the document written by RaiskaaHTML has to end up on your hard disk in any case.
Refer to the manual of your browser how to specify the URI currently browsing as command line parameter to scripts. In general, it should look something like this:
Macro | Command |
Raiskaa! | rx RaiskaaHTML.rexx %u blink color |
Feel free to replace "blink color" by whatever options fit you.
This way you can process only one document a time.
If your browser does not open a console automatically, you have to redirect the output manually, e.g. by appending ">con:////RaiskaaHTML/CLOSE/WAIT" to the above command.
The From options specifies the HTML document you want to convert and is required.
If you do not specify a file but a directory, the directory and all
its subdirectories are scanned for files with the suffices
.html
and .htm
. This is useful if you want
to convert a whole downloaded site with the same set of options.
With the Toggle switch enabled, all HTML options described below are toggled. That means all such switches are enabled excepts those you specified.
If you specify other switches together with Toggle, they are disabled. For example Toggle DocType removes everything possible but preserved the document type declaration.
With the DocType switch enabled, a possible document type declaration in the first line of the document is removed. This saves somes space, but you won't be able to use any SGML parser on the document.
So you better enable this only if you really know what you are doing.
See also: SGML
With the Font switch enabled, all font faces specified with <font face=..> are removed. This does not effect the font size specified with <font size=..>, as this usually increases the legibility of a document. With the Heading switch enabled, all heading alignments are removed. The headings themselves remain of course.With the Linefeed switch enabled, the output document uses a simple linefeed character to separate lines.
For document created on MS-DOS based systems, this saves you some space (one byte per line) because the carriage return is removed. For documents created on MacOS, this saves you some trouble because every carriage return is replaced by a linefeed.
For documents created on Amiga or Unix systems, this switch has no effect - but also does no harm.
With the Link switch enabled, all <link> tags are removed. This does not mean that all links in the documents are removed; such links would be done with <a> (anchor).
<Link> is only used to specify "hidden" links in the document that usually contain relationship information to other documents. For example, a document could contain the following tag:
<link rel="next" href="chapter17.html">
But you will not see anything unless your browser does something special about it. This way, the link tag could easily be used to have a toolbar with a Previous, Next, Contents and so on buttons, like AmigaGuide had since ever.
As a matter of fact, most web browsers are far too stupid to do anything useful with this tag, so it is completely useless in practice. Sad to say.
See also: Meta
With the Meta switch enabled, all <meta> tags are removed. The meta tag is to some extend comparable with <link>, as it can store various invisible information in the document.
It is commonly used to support search engines in finding out what the document is about. An example usage could be
<meta name="keywords" content="html, junk, crap, strip, remove">
In a local copy on your disk, there hardly is any use for this information.
See also: Link
With the Ruler switch enabled, horizontal rulers done with <hr> are displayed as simple bars. All silly images, useless widths and sizes are gone then.
Note however that it is also common to use images as vertical bar by means of the <img>. These naturally can't be "normalized".
With the Script switch enabled, all scripting tags and attributes are removed from all tags. This refers to the <script> and <noscript> tags and the attributes like onmouseover and onkeypress, which can be used within several other tags.
However, the scripts themselves are not removed as they are usually included in SGML comments. For example:
<script language=JavaScript> <!-- document.write('Supports JavaScript.') // --> </script> <noscript> Does not support JavaScript. </noscript>
With only Script enabled, this yields:
<!-- document.write('Supports JavaScript.') // --> Does not support JavaScript.
Note that the script source is still part of the document, but invisible as it is included in an SGML comment. With both Script and SGML enabled, the result is:
Does not support JavaScript.
See also: SGML
With the SGML switch enabled, all SGML comments except the document type declaration are removed.
This sound more complicated than it actually is. When you look at the source of some documents, you might notice lines like this:
<!-- Todo: insert link to Hugo's homepage -->
But for some reason you can't see this text when you view the document in your Browser. That is because the cryptic <!-- ... --> denotes a comment. As you can't see it, there usually is no reason to let these comments waste space.
However, SGML comments are commonly used to hide scripts from the browser. If the author of the document was dumb enough, you might not be able to navigate inside the document after removing those "out commented" scripts.
With the Space switch enabled, some redundant white space is removed. This does not influence how the document appears in the browsers.
Inside <pre> (for rendering preformatted text), this white space still is preserved.
Note that some blanks that could actually be removed remain. Finding these would make the parsing process much more difficult, resource consuming and slower.
With the Table switch enabled, background colors of tables are removed.
The table itself is of course preserved.
With the OnError option you can specify what to do about documents containing faulty HTML code. Possible values are (upper/lower case does not matter):
The default value is Ask.
With the Ignore switch enabled, warnings about faulty HTML code are ignored and the document is still written.
Use this switch with caution when processing directories as there is no way to restore the original document once it has been overwritten.
With the ScMsg switch enabled, parser related error messages are no more displayed in the console but sent to the ScMSg message browser.
ScMsg is part of the SAS/c compiler package. Naturally it has to be installed before using this switch. If the message browser is not already running, the script will start it automatically (expecting it to be in sc:c/ScMsg).
RaiskaaHTML is copyright 1998 by Thomas Aglassinger. All rights reserved.
RaiskaaHTML is freeware. You can use it without having to pay and you can freely redistribute it as long all files coming with the archive are preserved and no files are added or removed.
You use this material at your own risk. No responsibilities are taken for trashed HTML documents, damaged Amigas or any other components or data involved while using RaiskaaHTML.
New versions of RaiskaaHTML are uploaded to aminet:comm/www/RaiskaaHTML.lha, check aminet:comm/www/RaiskaaHTML.readme to find out if there is any.
Suggestions are not really welcome. This tool does what I want it to do, extending it hardly makes sense. See also Future.
Bug reports can be sent to Thomas Aglassinger <agi@sbox.tu-graz.ac.at>.
Basically I don't want to change much on it. That means I'm not thinking about adding a configuration file where the user can specify which attributes to strip from which tag or such things. Although this would be more flexible, it would mostly make the program more difficult to use.
If a <font face=..> is reduced to <font>, the whole tag could be stripped. This would need to implement some stack, which is pretty dull to do.
One thing that would be interesting is to add a GUI, preferably by using MUIRexx. Not that I plan to do it, but if somebody else improves the ARexx script, I will happily include these changes. This would make the script much more efficient to be used from the browser or DOpus as one could then specify a static set of options and change them on the fly depending on the files to process.